Contents: Applied Statistics Project
# Import all the libraries needed to complete the analysis, visualization, modeling and presentation
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
import matplotlib.patches as mpatches
The Background:
Contingency tables classify outcomes in rows and columns. Table cells at the intersections of rows and columns indicate frequencies of both events coinciding.
Consequently, to calculate joint probabilities in a contingency table, take each cell count and divide by the grand total.
In contingency tables, you can locate the marginal probabilities in the row and column totals. Statisticians refer to them as marginal probabilities because you find them in the margins of contingency tables!
Fortunately, using contingency tables to calculate conditional probabilities is straightforward. It’s merely a matter of dividing a cell value by a row or column total.
Probability of A given B is written as P(A | B).
$P(A\mid B) = \frac{P(A \quad and \quad B)}{P(B)}$
where P(A and B) = Joint probability of A and B
P(A) = Marginal probability of A P(B) = Marginal probability of B
# The joint probability of the people who planned to purchase and actually placed an order
P1 = 400/2000
P1
0.2
# The joint probability of the people who planned to purchase and actually placed an order, given that
# people planned to purchase.
P2 = 400/500
P2
0.8
Binomial Distribution
It is widely used probability distribution of a discrete random variable.
Plays major role in quality control and quality assurance function.
$P(X = x\mid n,\pi)$ = $\frac{n!}{x!(n - x)!}\pi^x (1 - \pi)^{n-x} $
and $\pi$ is the probability of an event of interest
Some important functions in Python for Binomial distribution:
scipy.stats.binom.pmf gives the probability mass function for the binomial distribution binomial = scipy.stats.binom.pmf (k,n,p), where k is an array and takes values in {0, 1,..., n} n and p are shape parameters for the binomial distribution The output, binomial, gives probability of binomial distribution function in terms of array.
cumbinomial = scipy.stats.binom.cdf(k,n,p) gives cumulative binomial distribution. The output, cumbinomial, gives cumulative probability of binomial distribution function in terms of array.
The function, matplotlib.pyplot.plot(k, binomial, ‘o-’) gives us plot of the binomial distribution function.
# The failure rate for the manufactured item is 5%
p = 0.05
q = 1 - p
n = 10
k=np.arange(0,8)
k
array([0, 1, 2, 3, 4, 5, 6, 7])
# Considering k = 0
# NoneDefective = P(0)
NoneDefective = stats.binom.pmf(0,n,p)
NoneDefective
0.5987369392383787
# Considering k = 1
# OneDefectivee = P(1)
OneDefective = stats.binom.pmf(1,n,p)
OneDefective
0.3151247048623047
#Cosidering k =2
TwoDefective = stats.binom.pmf(2,n,p)
TwoDefective
0.07463479852001963
# Two or fewer of the manufactured items are defective = P(0) + P(1) + P(2)
TwoOrFewerDefective = NoneDefective + OneDefective + TwoDefective
TwoOrFewerDefective
0.988496442620703
# Considering the sum of probabilites as 1
ThreeDefective = stats.binom.pmf(3,n,p)
ThreeOrMoreDefective = 1 - (NoneDefective + OneDefective + TwoDefective + ThreeDefective)
ThreeOrMoreDefective
0.0010284979378907266
Poissson Distribution
This discrete distribution which also plays a major role in quality control.
The Poisson distribution is a discrete probability distribution for the counts of events that occur randomly in a given interval of time or space. In such areas of opportunity, there can be more than one occurrence. In such situations, Poisson distribution can be used to compute probabilities.
Examples include number of defects per item, number of defects per transformer produced. Notes: Poisson Distribution helps to predict the arrival rate in a waiting line situation where a queue is formed and people wait to be served and the service rate is generally higher than the arrival rate.
Properties:
Mean μ = λ
Standard deviation σ = √ μ
The Poisson distribution is the limit of binomial distribution as n approaches ∞and p approaches 0
P(X = x) = $\frac{e^\lambda \lambda^x}{x!} $ where
Applications
Car Accidents
Number of deaths by horse kicking in Prussian Army (first application)
Birth defects and genetic mutation
Note
If there is a fixed number of observations, n, each of which is classified as an event of interest or not an event of interest, use the binomial distribution.
If there is an area of opportunity, use the Poisson distribution.
# Average rate is denoted with lambda in the formula for poisson distribution
rate = 3
# Using a numpy array to save different no. of successes ranging from 0 to 19 to construct a probability distribution
n=np.arange(0,20)
# Calculating the distribution and storing the distribution of probablitites in an array
PoissonPmf = stats.poisson.pmf(n,rate)
# Assuming the sum of probabilites as 1
SellSomeCars = 1 - PoissonPmf[0]
SellSomeCars
0.950212931632136
SellTwoToFive = PoissonPmf[2]+PoissonPmf[3]+PoissonPmf[4]
SellTwoToFive
0.6161149710523164
# Now we can create an array with Poisson cumulative probability values:
from scipy.stats import poisson
cdf = poisson.cdf(n, mu=3)
cdf = np.round(cdf, 3)
print(cdf)
[0.05 0.199 0.423 0.647 0.815 0.916 0.966 0.988 0.996 0.999 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. ]
# If you want to print it in a nicer way with each 'n' value and the corresponding cumulative probability:
for val, prob in zip(n,cdf):
print(f"n-value {val} has probability = {prob}")
n-value 0 has probability = 0.05 n-value 1 has probability = 0.199 n-value 2 has probability = 0.423 n-value 3 has probability = 0.647 n-value 4 has probability = 0.815 n-value 5 has probability = 0.916 n-value 6 has probability = 0.966 n-value 7 has probability = 0.988 n-value 8 has probability = 0.996 n-value 9 has probability = 0.999 n-value 10 has probability = 1.0 n-value 11 has probability = 1.0 n-value 12 has probability = 1.0 n-value 13 has probability = 1.0 n-value 14 has probability = 1.0 n-value 15 has probability = 1.0 n-value 16 has probability = 1.0 n-value 17 has probability = 1.0 n-value 18 has probability = 1.0 n-value 19 has probability = 1.0
# Using matplotlib library, we can easily plot the Poisson CDF using Python:
plt.plot(n, cdf, marker='o')
plt.title('Poisson: $\lambda$ = %i ' % rate)
plt.xlabel('Number of Cars Sold per Week')
plt.ylabel('Cumulative Probability of Number of Cars Sold per Week')
plt.show()
# Printing the probability distribution for different values of x
PoissonPmf
array([4.97870684e-02, 1.49361205e-01, 2.24041808e-01, 2.24041808e-01,
1.68031356e-01, 1.00818813e-01, 5.04094067e-02, 2.16040315e-02,
8.10151179e-03, 2.70050393e-03, 8.10151179e-04, 2.20950322e-04,
5.52375804e-05, 1.27471339e-05, 2.73152870e-06, 5.46305740e-07,
1.02432326e-07, 1.80762929e-08, 3.01271548e-09, 4.75691918e-10])
# Using matplotlib library, we can easily plot the Poisson PMF using Python:
plt.plot(n,PoissonPmf,'o-')
plt.title('Poisson: $\lambda$ = %i ' % rate)
plt.xlabel('Number of Cars Sold per Week')
plt.ylabel('Probability of Number of Cars Sold per Week')
plt.show()
Because there are three orders and the probability of a correct order is 0.868. Using Binomial distribution equation,
P(X = 3|n =3, pi given = 0.868)
3!/ 3!(3-3)! * (0.868)^3 * (1-0.868)^3-3 = 0.6540
Likewise, calculate X= 0, X=2
(X = 0) = 0.0023
(X = 2) = 0.2984
Hence, P(X>=2) = P(X=2)+P(X=3) = 0.9524
# Since there are only 2 events orders filled correctly and not filled correctly
# and the number of trials is 3, we can use Binomial distribution with following parameters:
p1 = 0.868
q1 = 1 - p1
n1 = 3
ThreeOrdersRecognisedCorrect = stats.binom.pmf(3,3,q1)
ThreeOrdersRecognisedCorrect
0.0022999680000000003
NoneThreeOrdersRecognisedCorrect = stats.binom.pmf(0,3,p1)
NoneThreeOrdersRecognisedCorrect
0.0022999680000000003
# We need to find the (1 - Probability of upto 1 Failure)
TwoOfThreeRecognisedCorrect = 1 - stats.binom.pmf(1,3,p1)
TwoOfThreeRecognisedCorrect
0.954627904
We can use statistics to evaluate potential new versions of a children’s dry cereal. We can use taste tests to provide valuable statistical information on what customers want from a product. The four key factors that product developers may consider to enhance the taste of the cereal are the following:
We should design an experiment to determine what effects these four factors had on cereal taste. For example, one test cereal can be made with a specified ratio of wheat to corn, sugar as the sweetener, flavour bits, and a short cooking time; another test cereal can be made with a different ratio of wheat to corn and the other three factors the same, and so on. Groups of children then taste-test the cereals and state what they think about the taste of each.
The Analysis of variance (ANOVA) is the statistical method we can use to study the data obtained from the taste tests. The results of the analysis may show the following:
This information can be vital to identify the factors that would lead to the best-tasting cereal. The same information can be used by the marketing and manufacturing teams, and for a better product development strategy.
Tools to be used: Python, R, Minitab, Excel, MS SQL Server
References:
The dataset contains information on all the teams that have participated in all the past tournaments. It has data about how many baskets each team scored, conceded, how many times they came within the first 2 positions, how many tournaments they have qualified, their best position in the past, etc.
all the past tournaments.
Team: Team’s nameTournament: Number of played tournaments.Score: Team’s score so far.PlayedGames: Games played by the team so far.WonGames: Games won by the team so far.DrawnGames: Games drawn by the team so far.LostGames: Games lost by the team so far.BasketScored: Basket scored by the team so far.BasketGiven: Basket scored against the team so far.TournamentChampion: How many times the team was a champion of the tournaments so far.Runner-up: How many times the team was a runners-up of the tournaments so far.TeamLaunch: Year the team was launched on professional basketball.HighestPositionHeld: Highest position held by the team amongst all the tournaments played.teams in the league. The analytics department has been assigned with a task of creating a report on the performance shown by the teams. Some of the older teams are already in contract with competitors. Hence Company X wants to understand which teams they can approach which will be a deal win for them. ___
# Import all the libraries needed to complete the analysis, visualization, modeling and presentation
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
import matplotlib.patches as mpatches
# Configure for any default setting of any library
%matplotlib inline
sns.set(style='darkgrid', palette='deep', font='sans-serif', font_scale=1.3, color_codes=True)
Some Comments about the Libraries:
%matplotlib inline sets the backend of matplotlib to the 'inline' backend: With this backend, the output of plotting commands is displayed inline without needing to call plt.show() every time a data is plotted.basketball=pd.read_csv("Basketball.csv");
#There are 61 Observations / Rows and 13 Attributes / Columns.
basketball.shape
(61, 13)
basketball.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 61 entries, 0 to 60 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Team 61 non-null object 1 Tournament 61 non-null int64 2 Score 61 non-null object 3 PlayedGames 61 non-null object 4 WonGames 61 non-null object 5 DrawnGames 61 non-null object 6 LostGames 61 non-null object 7 BasketScored 61 non-null object 8 BasketGiven 61 non-null object 9 TournamentChampion 61 non-null object 10 Runner-up 61 non-null object 11 TeamLaunch 61 non-null object 12 HighestPositionHeld 61 non-null int64 dtypes: int64(2), object(11) memory usage: 6.3+ KB
# Check the head of the dataset
basketball.head()
| Team | Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | Runner-up | TeamLaunch | HighestPositionHeld | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Team 1 | 86 | 4385 | 2762 | 1647 | 552 | 563 | 5947 | 3140 | 33 | 23 | 1929 | 1 |
| 1 | Team 2 | 86 | 4262 | 2762 | 1581 | 573 | 608 | 5900 | 3114 | 25 | 25 | 1929 | 1 |
| 2 | Team 3 | 80 | 3442 | 2614 | 1241 | 598 | 775 | 4534 | 3309 | 10 | 8 | 1929 | 1 |
| 3 | Team 4 | 82 | 3386 | 2664 | 1187 | 616 | 861 | 4398 | 3469 | 6 | 6 | 1931to32 | 1 |
| 4 | Team 5 | 86 | 3368 | 2762 | 1209 | 633 | 920 | 4631 | 3700 | 8 | 7 | 1929 | 1 |
# Check the tail of the dataset
basketball.tail()
| Team | Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | Runner-up | TeamLaunch | HighestPositionHeld | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 56 | Team 57 | 1 | 34 | 38 | 8 | 10 | 20 | 38 | 66 | - | - | 2009-10 | 20 |
| 57 | Team 58 | 1 | 22 | 30 | 7 | 8 | 15 | 37 | 57 | - | - | 1956-57 | 16 |
| 58 | Team 59 | 1 | 19 | 30 | 7 | 5 | 18 | 51 | 85 | - | - | 1951~52 | 16 |
| 59 | Team 60 | 1 | 14 | 30 | 5 | 4 | 21 | 34 | 65 | - | - | 1955-56 | 15 |
| 60 | Team 61 | 1 | - | - | - | - | - | - | - | - | - | 2017~18 | 9 |
# Check for null values
# No Null values in data
basketball.isna().sum()
Team 0 Tournament 0 Score 0 PlayedGames 0 WonGames 0 DrawnGames 0 LostGames 0 BasketScored 0 BasketGiven 0 TournamentChampion 0 Runner-up 0 TeamLaunch 0 HighestPositionHeld 0 dtype: int64
# Check for duplicates in data
# We do not have duplicates
dupes = basketball.duplicated()
sum(dupes)
0
# Generate descriptive statistics.
# Descriptive statistics include those that summarize the central
# tendency, dispersion and shape of a
# dataset's distribution, excluding ``NaN`` values.
# Analyzes both numeric and object series, as well
# as ``DataFrame`` column sets of mixed data types. The output
# will vary depending on what is provided.
basketball.describe(include="all")
| Team | Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | Runner-up | TeamLaunch | HighestPositionHeld | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 61 | 61.000000 | 61 | 61 | 61 | 61 | 61 | 61 | 61 | 61 | 61 | 61 | 61.000000 |
| unique | 61 | NaN | 61 | 53 | 59 | 57 | 56 | 60 | 61 | 8 | 10 | 47 | NaN |
| top | Team 1 | NaN | 4385 | 2762 | 7 | 14 | 37 | 70 | 3140 | - | - | 1929 | NaN |
| freq | 1 | NaN | 1 | 3 | 2 | 2 | 3 | 2 | 1 | 52 | 48 | 10 | NaN |
| mean | NaN | 24.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 7.081967 |
| std | NaN | 26.827225 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 5.276663 |
| min | NaN | 1.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.000000 |
| 25% | NaN | 4.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 3.000000 |
| 50% | NaN | 12.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 6.000000 |
| 75% | NaN | 38.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 10.000000 |
| max | NaN | 86.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 20.000000 |
Some Insights:
Most of the values are showing NaN issue, so from the above dataset insights (Head, Tail, Descriptive Statistics), there would be some missing values.
We can see difference in the input data for TeamLaunch, TournamentChampion, and Runner-up columns. So lets inspect data in these 3 columns.
Also lets make the column Names consistent: Runner-up > RunnerUp
# Rename the columns for consistency
basketball.rename(columns = {'Runner-up': 'RunnerUp'}, inplace = True)
# Check for unique values in various columns
print("TournamentChampion",basketball['TournamentChampion'].unique())
print("RunnerUp",basketball['RunnerUp'].unique())
print("TeamLaunch",basketball['TeamLaunch'].unique())
print("Team",basketball['Team'].unique())
print("Score",basketball['Score'].unique())
TournamentChampion ['33' '25' '10' '6' '8' '1' '-' '2'] RunnerUp ['23' '25' '8' '6' '7' '4' '-' '3' '1' '5'] TeamLaunch ['1929' '1931to32' '1934-35' '1939-40' '1932-33' '1941to42' '1948-49' '1944_45' '1935-36' '1949_50' '1933to34' '1960-61' '1951-52' '1998-99' '1941-42' '1977-78' '1959-60' '2004to05' '1961-62' '1940-41' '1930-31' '1963-64' '1974-75' '1943-44' '1987-88' '1991_92' '2007-08' '1962-63' '1994-95' '1978-79' '1971-72' '1999to00' '2014-15' '1990-91' '1947-48' '1996-97' '1995-96' '1945-46' '1953-54' '1979-80' '1950-51' '2016_17' '2009-10' '1956-57' '1951~52' '1955-56' '2017~18'] Team ['Team 1' 'Team 2' 'Team 3' 'Team 4' 'Team 5' 'Team 6' 'Team 7' 'Team 8' 'Team 9' 'Team 10' 'Team 11' 'Team 12' 'Team 13' 'Team 14' 'Team 15' 'Team 16' 'Team 17' 'Team 18' 'Team 19' 'Team 20' 'Team 21' 'Team 22' 'Team 23' 'Team 24' 'Team 25' 'Team 26' 'Team 27' 'Team 28' 'Team 29' 'Team 30' 'Team 31' 'Team 32' 'Team 33' 'Team 34' 'Team 35' 'Team 36' 'Team 37' 'Team 38' 'Team 39' 'Team 40' 'Team 41' 'Team 42' 'Team 43' 'Team 44' 'Team 45' 'Team 46' 'Team 47' 'Team 48' 'Team 49' 'Team 50' 'Team 51' 'Team 52' 'Team 53' 'Team 54' 'Team 55' 'Team 56' 'Team 57' 'Team 58' 'Team 59' 'Team 60' 'Team 61'] Score ['4385' '4262' '3442' '3386' '3368' '2819' '2792' '2573' '2109' '1884' '1814' '1789' '1471' '1416' '1389' '1351' '1314' '1174' '1148' '1020' '970' '667' '662' '606' '553' '538' '510' '445' '421' '416' '375' '353' '343' '293' '285' '277' '242' '230' '190' '188' '168' '150' '148' '132' '107' '96' '91' '83' '81' '76' '71' '56' '52' '42' '40' '35' '34' '22' '19' '14' '-']
Some Observations:
# Make a copy of the original dataset
basketballOriginal=basketball.copy(deep=True);
# For TeamLaunch year, we can consider just the initial year for easy data cleaning and analysis.
basketball['TeamLaunch'] = basketballOriginal['TeamLaunch'].str.slice(0,4);
# We can fill Zero in place of missing value '-'.
# Because if HighestPositionHeld is not equal to 1 or 2 then TournamentChampion, Runner-up fields are zero too.
basketball=basketball.replace('-',0);
# For easy data handling and visualization, we can rename the Team as T in the 'Team' Column.
basketball['Team']=basketball['Team'].str.replace('Team ','T');
columns = basketball.columns.drop('Team');
# Data types of most of the columns are Objects;
# As values are in integers, it would be beneficial to do the type conversion here.
basketball[columns] = basketball[columns].apply(pd.to_numeric, errors='coerce')
# Team 61 seems to have no valid data: It participated in 1 tournament and got 9th position.
# No information about games or baskets. We can safely drop it.
basketball.drop(60,inplace=True)
# Check the head of the dataset after cleaning
basketball.head(5)
| Team | Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | RunnerUp | TeamLaunch | HighestPositionHeld | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | T1 | 86 | 4385 | 2762 | 1647 | 552 | 563 | 5947 | 3140 | 33 | 23 | 1929 | 1 |
| 1 | T2 | 86 | 4262 | 2762 | 1581 | 573 | 608 | 5900 | 3114 | 25 | 25 | 1929 | 1 |
| 2 | T3 | 80 | 3442 | 2614 | 1241 | 598 | 775 | 4534 | 3309 | 10 | 8 | 1929 | 1 |
| 3 | T4 | 82 | 3386 | 2664 | 1187 | 616 | 861 | 4398 | 3469 | 6 | 6 | 1931 | 1 |
| 4 | T5 | 86 | 3368 | 2762 | 1209 | 633 | 920 | 4631 | 3700 | 8 | 7 | 1929 | 1 |
# Check the tail of the dataset after cleaning
basketball.tail(5)
| Team | Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | RunnerUp | TeamLaunch | HighestPositionHeld | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 55 | T56 | 1 | 35 | 38 | 8 | 11 | 19 | 36 | 55 | 0 | 0 | 2016 | 17 |
| 56 | T57 | 1 | 34 | 38 | 8 | 10 | 20 | 38 | 66 | 0 | 0 | 2009 | 20 |
| 57 | T58 | 1 | 22 | 30 | 7 | 8 | 15 | 37 | 57 | 0 | 0 | 1956 | 16 |
| 58 | T59 | 1 | 19 | 30 | 7 | 5 | 18 | 51 | 85 | 0 | 0 | 1951 | 16 |
| 59 | T60 | 1 | 14 | 30 | 5 | 4 | 21 | 34 | 65 | 0 | 0 | 1955 | 15 |
basketball.describe(include="all")
| Team | Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | RunnerUp | TeamLaunch | HighestPositionHeld | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 60 | 60.000000 | 60.000000 | 60.000000 | 60.000000 | 60.000000 | 60.000000 | 60.000000 | 60.000000 | 60.00000 | 60.000000 | 60.000000 | 60.000000 |
| unique | 60 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| top | T1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| freq | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| mean | NaN | 24.383333 | 916.450000 | 810.100000 | 309.033333 | 192.083333 | 308.816667 | 1159.350000 | 1159.233333 | 1.45000 | 1.433333 | 1957.950000 | 7.050000 |
| std | NaN | 26.884620 | 1138.342899 | 877.465393 | 408.481395 | 201.985508 | 294.508639 | 1512.063948 | 1163.946914 | 5.51554 | 4.574679 | 26.646732 | 5.315232 |
| min | NaN | 1.000000 | 14.000000 | 30.000000 | 5.000000 | 4.000000 | 15.000000 | 34.000000 | 55.000000 | 0.00000 | 0.000000 | 1929.000000 | 1.000000 |
| 25% | NaN | 4.000000 | 104.250000 | 115.500000 | 34.750000 | 26.250000 | 62.750000 | 154.500000 | 236.000000 | 0.00000 | 0.000000 | 1934.750000 | 3.000000 |
| 50% | NaN | 12.000000 | 395.500000 | 424.500000 | 124.000000 | 98.500000 | 197.500000 | 444.000000 | 632.500000 | 0.00000 | 0.000000 | 1950.500000 | 6.000000 |
| 75% | NaN | 39.000000 | 1360.500000 | 1345.500000 | 432.750000 | 331.500000 | 563.500000 | 1669.750000 | 2001.250000 | 0.00000 | 0.000000 | 1977.250000 | 10.000000 |
| max | NaN | 86.000000 | 4385.000000 | 2762.000000 | 1647.000000 | 633.000000 | 1070.000000 | 5947.000000 | 3889.000000 | 33.00000 | 25.000000 | 2016.000000 | 20.000000 |
Univariate analysis refer to the analysis of a single variable. The main purpose of univariate analysis is to summarize and find patterns in the data. The key point is that there is only one variable involved in the analysis.
# Print the mean of each attribute. Ignore "Team as it is not a continuous variable"
print("basketball:",basketball.mean())
basketball: Tournament 24.383333 Score 916.450000 PlayedGames 810.100000 WonGames 309.033333 DrawnGames 192.083333 LostGames 308.816667 BasketScored 1159.350000 BasketGiven 1159.233333 TournamentChampion 1.450000 RunnerUp 1.433333 TeamLaunch 1957.950000 HighestPositionHeld 7.050000 dtype: float64
# Print the median values of the basketball.
# Observe that the values of mean and median are not the same for most of the attributes.
print("basketball:",basketball.median())
basketball: Tournament 12.0 Score 395.5 PlayedGames 424.5 WonGames 124.0 DrawnGames 98.5 LostGames 197.5 BasketScored 444.0 BasketGiven 632.5 TournamentChampion 0.0 RunnerUp 0.0 TeamLaunch 1950.5 HighestPositionHeld 6.0 dtype: float64
# Prints the mode of the attribute. The Attribute is unimodal
print(basketball['HighestPositionHeld'].mode())
0 1 dtype: int64
# Prints the value below which 25% of the data lies
print("Data_quantile(25%):",basketball.quantile(q=0.25))
Data_quantile(25%): Tournament 4.00 Score 104.25 PlayedGames 115.50 WonGames 34.75 DrawnGames 26.25 LostGames 62.75 BasketScored 154.50 BasketGiven 236.00 TournamentChampion 0.00 RunnerUp 0.00 TeamLaunch 1934.75 HighestPositionHeld 3.00 Name: 0.25, dtype: float64
# Prints the value below which 50% of the data lies
print("Data_quantile(50%):",basketball.quantile(q=0.50))
Data_quantile(50%): Tournament 12.0 Score 395.5 PlayedGames 424.5 WonGames 124.0 DrawnGames 98.5 LostGames 197.5 BasketScored 444.0 BasketGiven 632.5 TournamentChampion 0.0 RunnerUp 0.0 TeamLaunch 1950.5 HighestPositionHeld 6.0 Name: 0.5, dtype: float64
# Prints the value below which 75% of the data lies
print("Data_quantile(75%):",basketball.quantile(q=0.75))
Data_quantile(75%): Tournament 39.00 Score 1360.50 PlayedGames 1345.50 WonGames 432.75 DrawnGames 331.50 LostGames 563.50 BasketScored 1669.75 BasketGiven 2001.25 TournamentChampion 0.00 RunnerUp 0.00 TeamLaunch 1977.25 HighestPositionHeld 10.00 Name: 0.75, dtype: float64
# The below output represents the IQR values for all the attributes
basketball.quantile(0.75) - basketball.quantile(0.25)
Tournament 35.00 Score 1256.25 PlayedGames 1230.00 WonGames 398.00 DrawnGames 305.25 LostGames 500.75 BasketScored 1515.25 BasketGiven 1765.25 TournamentChampion 0.00 RunnerUp 0.00 TeamLaunch 42.50 HighestPositionHeld 7.00 dtype: float64
# Range: The difference between the highest value and lowest values for all individual attributes
# columns = basketball.columns.drop('Team')
print(basketball[columns].max() - basketball[columns].min())
Tournament 85 Score 4371 PlayedGames 2732 WonGames 1642 DrawnGames 629 LostGames 1055 BasketScored 5913 BasketGiven 3834 TournamentChampion 33 RunnerUp 25 TeamLaunch 87 HighestPositionHeld 19 dtype: int64
# The below output says how much was the data dispersion
print(basketball.var())
Tournament 7.227828e+02 Score 1.295825e+06 PlayedGames 7.699455e+05 WonGames 1.668570e+05 DrawnGames 4.079815e+04 LostGames 8.673534e+04 BasketScored 2.286337e+06 BasketGiven 1.354772e+06 TournamentChampion 3.042119e+01 RunnerUp 2.092768e+01 TeamLaunch 7.100483e+02 HighestPositionHeld 2.825169e+01 dtype: float64
# The below output says how much the data deviated from the mean.
print(basketball.std())
Tournament 26.884620 Score 1138.342899 PlayedGames 877.465393 WonGames 408.481395 DrawnGames 201.985508 LostGames 294.508639 BasketScored 1512.063948 BasketGiven 1163.946914 TournamentChampion 5.515540 RunnerUp 4.574679 TeamLaunch 26.646732 HighestPositionHeld 5.315232 dtype: float64
# Understand the Skewness of the data
# Positively skewed: Most frequent values are low and tail is towards the high values.
basketball.skew()
Tournament 1.197176 Score 1.574104 PlayedGames 1.123454 WonGames 1.786067 DrawnGames 0.984899 LostGames 0.880596 BasketScored 1.758058 BasketGiven 0.958164 TournamentChampion 4.734845 RunnerUp 4.321794 TeamLaunch 0.682570 HighestPositionHeld 0.832164 dtype: float64
# Understand the Kurtosis of the data
basketball.kurtosis()
Tournament 0.176135 Score 1.686503 PlayedGames -0.030398 WonGames 2.577189 DrawnGames -0.422334 LostGames -0.456038 BasketScored 2.406123 BasketGiven -0.451057 TournamentChampion 23.326192 RunnerUp 19.482117 TeamLaunch -0.787472 HighestPositionHeld -0.284934 dtype: float64
# Check for distribution, skewness
# This function combines the matplotlib hist function (with automatic calculation of a
# good default bin size) with the seaborn kdeplot() and rugplot() functions. It can
# also fit scipy.stats distributions and plot the estimated PDF over the data.
select = ['Tournament','Score','PlayedGames','WonGames','DrawnGames','LostGames',
'BasketScored','BasketGiven','TournamentChampion','RunnerUp','TeamLaunch',
'HighestPositionHeld']
plt.figure(figsize=(20,20))
index = 1
for col in basketball[select]:
plt.subplot(4,3,index)
sns.distplot(basketball[col], rug=True, kde=True,
rug_kws={"color": "r"},
kde_kws={"color": "k"},
hist_kws={"color": "c"})
index += 1
# Draw a box plot to show distributions with respect to categories.
# A box plot (or box-and-whisker plot) shows the distribution of quantitative
# data in a way that facilitates comparisons between variables or across
# levels of a categorical variable. The box shows the quartiles of the
# dataset while the whiskers extend to show the rest of the distribution,
# except for points that are determined to be "outliers" using a method
# that is a function of the inter-quartile range.
fig, ax = plt.subplots(3, 4)
fig.set_figheight(5)
fig.set_figwidth(15)
sns.boxplot(x=basketball['PlayedGames'],ax=ax[0][0]);
sns.boxplot(x=basketball['WonGames'],ax=ax[0][1]);
sns.boxplot(x=basketball['TournamentChampion'],ax=ax[0][2]);
sns.boxplot(x=basketball['RunnerUp'],ax=ax[0][3]);
sns.boxplot(x=basketball['TeamLaunch'],ax=ax[1][0]);
sns.boxplot(x=basketball['HighestPositionHeld'],ax=ax[1][1]);
sns.boxplot(x=basketball['BasketScored'],ax=ax[1][2]);
sns.boxplot(x=basketball['BasketGiven'],ax=ax[1][3]);
sns.boxplot(x=basketball['Tournament'],ax=ax[2][0]);
sns.boxplot(x=basketball['Score'],ax=ax[2][1]);
sns.boxplot(x=basketball['DrawnGames'],ax=ax[2][2]);
sns.boxplot(x=basketball['LostGames'],ax=ax[2][3]);
fig.tight_layout()
plt.show()
# Consider the distibution of all the quantitative variables
plt.figure(figsize=(10,5))
columns = basketball.columns.drop('Team');
basketballColumns = basketball[columns];
sns.distplot(basketballColumns);
plt.show()
# Cumulative distribution for all the quantitative variables
plt.figure(figsize=(10,5))
sns.distplot(basketballColumns, hist_kws=dict(cumulative=True), kde_kws=dict(cumulative=True));
plt.show()
Important Insights:
# The following code plots a histrogram using the matplotlib package.
# The bins argument creates class intervals. In this case we are creating 50 such intervals
# In the above histogram, the first array is the frequency in each class and the second array
# contains the edges of the class intervals. These arrays can be assigned to a variable and
# used for further analysis.
plt.hist(basketball['Score'], bins=50)
(array([15., 6., 3., 5., 4., 2., 2., 2., 0., 0., 1., 1., 1.,
1., 1., 2., 2., 0., 0., 0., 2., 1., 0., 1., 0., 0.,
0., 0., 0., 1., 0., 1., 1., 0., 0., 0., 0., 0., 2.,
1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1.]),
array([ 14. , 101.42, 188.84, 276.26, 363.68, 451.1 , 538.52,
625.94, 713.36, 800.78, 888.2 , 975.62, 1063.04, 1150.46,
1237.88, 1325.3 , 1412.72, 1500.14, 1587.56, 1674.98, 1762.4 ,
1849.82, 1937.24, 2024.66, 2112.08, 2199.5 , 2286.92, 2374.34,
2461.76, 2549.18, 2636.6 , 2724.02, 2811.44, 2898.86, 2986.28,
3073.7 , 3161.12, 3248.54, 3335.96, 3423.38, 3510.8 , 3598.22,
3685.64, 3773.06, 3860.48, 3947.9 , 4035.32, 4122.74, 4210.16,
4297.58, 4385. ]),
<BarContainer object of 50 artists>)
# Lets plot a frequency polygon superimposed on a histogram using the seaborn package.
# Seaborn automatically creates class intervals. The number of bins can also be manually set.
sns.distplot(basketball['Score'])
<AxesSubplot:xlabel='Score', ylabel='Density'>
# Lets add an argument to plot only the frequency polygon
sns.distplot(basketball['Score'], hist=False)
<AxesSubplot:xlabel='Score', ylabel='Density'>
# Lets plot a violin plot using the seaborn package.
# This distribution can also be visualised in another manner.
# For this we can use the violin plot function from seaborn.
# The violin plot shows a vertical mirror image of the distribution
# along with the original distribution.
sns.violinplot(basketball['Score'])
<AxesSubplot:xlabel='Score'>
# Now let us have a closer look at the distribution by plotting a simple histogram with 10 bins.
plt.figure(figsize=(20,10)) # makes the plot wider
plt.hist(basketball['Score'], color='g') # plots a simple histogram
plt.axvline(basketball['Score'].mean(), color='r', linewidth=2, label='Mean')
plt.axvline(basketball['Score'].median(), color='b', linestyle='dashed', linewidth=2, label='Median')
plt.axvline(basketball['Score'].mode()[0], color='k', linestyle='dashed', linewidth=2, label='Mode')
plt.legend()
<matplotlib.legend.Legend at 0x21f39695670>
# Create boxplot for column="Score"
# Make a box plot from DataFrame columns.
# Make a box-and-whisker plot from DataFrame columns, optionally grouped
# by some other columns. A box plot is a method for graphically depicting
# groups of numerical data through their quartiles.
# The box extends from the Q1 to Q3 quartile values of the data,
# with a line at the median (Q2). The whiskers extend from the edges
# of box to show the range of the data. By default, they extend no more than
# `1.5 * IQR (IQR = Q3 - Q1)` from the edges of the box, ending at the farthest
# data point within that interval. Outliers are plotted as separate dots.
basketball.boxplot(column="Score",return_type='axes',figsize=(8,8))
# sns.boxplot(x=basketball['Score'])
<AxesSubplot:>
# Histogram for checking the Skewness
plt.figure(figsize=(10,5))
#convert pandas DataFrame object to numpy array and sort
h = np.asarray(basketball['Score'])
h = sorted(h)
#use the scipy stats module to fit a normal distirbution with same mean and standard deviation
fit = stats.norm.pdf(h, np.mean(h), np.std(h))
#plot both series on the histogram
plt.plot(h,fit,'-',linewidth = 2,label="Normal distribution with same mean and var")
plt.hist(h,density=True,bins = 100,label="Actual distribution")
plt.legend()
plt.show()
Important Insights:
#Flexibly plot a univariate distribution of observations.
sns.distplot(basketball.TeamLaunch)
<AxesSubplot:xlabel='TeamLaunch', ylabel='Density'>
# We can categorize variable Teamlaunch based on Quantiles
# Quantile-based discretization function.
# Discretize variable into equal-sized buckets based on rank or based
# on sample quantiles. For example 1000 values for 10 quantiles would
# produce a Categorical object indicating quantile membership for each data point.
basketball['TeamLaunchCategory'] = (pd.qcut(basketball['TeamLaunch'], 4,
labels=['Very Old', 'Old', 'New', 'Very New']));
# Show the counts of observations in each categorical bin using bars.
# A count plot can be thought of as a histogram across a categorical, instead
# of quantitative, variable. The basic API and options are identical to those
# for :func:`barplot`, so you can compare counts across nested variables.
plt.figure(figsize=(25,8))
sns.countplot(x='TeamLaunch',hue='TeamLaunchCategory',data = basketball)
plt.title("Team Launch count along with Category",size=15)
plt.xticks(rotation=90)
plt.show()
Multivariate analysis is performed to understand interactions between different fields in the dataset (or) finding interactions between more than 2 variables.
Examples: Pairplot, 3D scatter plot Etc.
# Prints the covariance of each attribute against every other attribute
basketball.cov()
| Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | RunnerUp | TeamLaunch | HighestPositionHeld | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Tournament | 722.782768 | 3.003033e+04 | 2.355925e+04 | 10671.614124 | 5372.170904 | 7516.715537 | 3.962956e+04 | 3.090991e+04 | 87.383898 | 79.441243 | -429.234746 | -101.172034 |
| Score | 30030.333051 | 1.295825e+06 | 9.785666e+05 | 463704.696610 | 220292.385593 | 294612.626271 | 1.715454e+06 | 1.250509e+06 | 4489.268644 | 3982.835593 | -16335.672034 | -4049.124576 |
| PlayedGames | 23559.249153 | 9.785666e+05 | 7.699455e+05 | 346774.216949 | 176166.940678 | 247031.154237 | 1.286941e+06 | 1.011131e+06 | 2783.174576 | 2541.345763 | -13590.384746 | -3316.361017 |
| WonGames | 10671.614124 | 4.637047e+05 | 3.467742e+05 | 166857.049718 | 77508.658192 | 102428.853672 | 6.172252e+05 | 4.401460e+05 | 1696.289831 | 1490.917514 | -5882.320339 | -1406.527119 |
| DrawnGames | 5372.170904 | 2.202924e+05 | 1.761669e+05 | 77508.658192 | 40798.145480 | 57867.608757 | 2.877725e+05 | 2.333322e+05 | 560.792373 | 523.014124 | -3068.826271 | -773.258475 |
| LostGames | 7516.715537 | 2.946126e+05 | 2.470312e+05 | 102428.853672 | 57867.608757 | 86735.338701 | 3.820110e+05 | 3.376645e+05 | 526.338136 | 527.657062 | -4642.466949 | -1136.736441 |
| BasketScored | 39629.558475 | 1.715454e+06 | 1.286941e+06 | 617225.191525 | 287772.461864 | 382010.997458 | 2.286337e+06 | 1.638664e+06 | 6203.568644 | 5471.794915 | -22166.405932 | -5213.356780 |
| BasketGiven | 30909.909040 | 1.250509e+06 | 1.011131e+06 | 440145.958192 | 233332.166667 | 337664.535028 | 1.638664e+06 | 1.354772e+06 | 3027.689831 | 2840.439548 | -18885.327119 | -4499.689831 |
| TournamentChampion | 87.383898 | 4.489269e+03 | 2.783175e+03 | 1696.289831 | 560.792373 | 526.338136 | 6.203569e+03 | 3.027690e+03 | 30.421186 | 24.513559 | -42.146610 | -8.921186 |
| RunnerUp | 79.441243 | 3.982836e+03 | 2.541346e+03 | 1490.917514 | 523.014124 | 527.657062 | 5.471795e+03 | 2.840440e+03 | 24.513559 | 20.927684 | -38.672881 | -8.733898 |
| TeamLaunch | -429.234746 | -1.633567e+04 | -1.359038e+04 | -5882.320339 | -3068.826271 | -4642.466949 | -2.216641e+04 | -1.888533e+04 | -42.146610 | -38.672881 | 710.048305 | 84.934746 |
| HighestPositionHeld | -101.172034 | -4.049125e+03 | -3.316361e+03 | -1406.527119 | -773.258475 | -1136.736441 | -5.213357e+03 | -4.499690e+03 | -8.921186 | -8.733898 | 84.934746 | 28.251695 |
# Prints the correlation coefficient between every pair of attributes
basketball.corr()
| Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | RunnerUp | TeamLaunch | HighestPositionHeld | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Tournament | 1.000000 | 0.981258 | 0.998683 | 0.971749 | 0.989295 | 0.949350 | 0.974867 | 0.987781 | 0.589304 | 0.645924 | -0.599166 | -0.708002 |
| Score | 0.981258 | 1.000000 | 0.979687 | 0.997232 | 0.958090 | 0.878780 | 0.996634 | 0.943801 | 0.715014 | 0.764819 | -0.538542 | -0.669215 |
| PlayedGames | 0.998683 | 0.979687 | 1.000000 | 0.967486 | 0.993972 | 0.955925 | 0.969970 | 0.990020 | 0.575072 | 0.633101 | -0.581243 | -0.711065 |
| WonGames | 0.971749 | 0.997232 | 0.967486 | 1.000000 | 0.939416 | 0.851436 | 0.999312 | 0.925745 | 0.752904 | 0.797849 | -0.540421 | -0.647819 |
| DrawnGames | 0.989295 | 0.958090 | 0.993972 | 0.939416 | 1.000000 | 0.972786 | 0.942234 | 0.992479 | 0.503378 | 0.566021 | -0.570175 | -0.720248 |
| LostGames | 0.949350 | 0.878780 | 0.955925 | 0.851436 | 0.972786 | 1.000000 | 0.857843 | 0.985041 | 0.324025 | 0.391645 | -0.591571 | -0.726172 |
| BasketScored | 0.974867 | 0.996634 | 0.969970 | 0.999312 | 0.942234 | 0.857843 | 1.000000 | 0.931079 | 0.743847 | 0.791041 | -0.550150 | -0.648672 |
| BasketGiven | 0.987781 | 0.943801 | 0.990020 | 0.925745 | 0.992479 | 0.985041 | 0.931079 | 1.000000 | 0.471618 | 0.533448 | -0.608902 | -0.727323 |
| TournamentChampion | 0.589304 | 0.715014 | 0.575072 | 0.752904 | 0.503378 | 0.324025 | 0.743847 | 0.471618 | 1.000000 | 0.971533 | -0.286768 | -0.304307 |
| RunnerUp | 0.645924 | 0.764819 | 0.633101 | 0.797849 | 0.566021 | 0.391645 | 0.791041 | 0.533448 | 0.971533 | 1.000000 | -0.317250 | -0.359191 |
| TeamLaunch | -0.599166 | -0.538542 | -0.581243 | -0.540421 | -0.570175 | -0.591571 | -0.550150 | -0.608902 | -0.286768 | -0.317250 | 1.000000 | 0.599679 |
| HighestPositionHeld | -0.708002 | -0.669215 | -0.711065 | -0.647819 | -0.720248 | -0.726172 | -0.648672 | -0.727323 | -0.304307 | -0.359191 | 0.599679 | 1.000000 |
# Plot rectangular data as a color-encoded matrix.
# This is an Axes-level function and will draw the heatmap into the
# currently-active Axes if none is provided to the ``ax`` argument. Part of
# this Axes space will be taken and used to plot a colormap, unless ``cbar``
# is False or a separate Axes is provided to ``cbar_ax``.
plt.figure(figsize=(10,5))
corrmat = basketball.corr(method='pearson')
sns.heatmap(corrmat, cmap="YlGnBu", fmt='.2f',annot=True)
plt.show();
# In the following plot scatter diagrams are plotted for all the numerical columns in the dataset.
# A scatter plot is a visual representation of the degree of correlation between any two columns.
# The pair plot function in seaborn makes it very easy to generate joint scatter plots for all the
# columns in the data.
plt.figure(figsize=(10,5))
pairplot=sns.pairplot(basketball);
plt.show()
<Figure size 720x360 with 0 Axes>
# Plots the scatter plot using two variables
sns.scatterplot(data=basketball, x="Tournament", y="Score", size="Score")
<AxesSubplot:xlabel='Tournament', ylabel='Score'>
# Another way of looking at multivariate scatter plot is to use the hue option
# in the scatterplot() function in seaborn.
# basketball['TeamLaunchCategory'] = (pd.qcut(basketball['TeamLaunch'], 4,
# labels=['Very Old', 'Old', 'New', 'Very New']));
sns.scatterplot(data=basketball, x="Tournament", y="Score", hue="TeamLaunchCategory")
<AxesSubplot:xlabel='Tournament', ylabel='Score'>
Important Insights:
attributes.
SortedDf=basketball.sort_values('Tournament', ascending=False);
plt.figure(figsize=(40,10))
# set height of bar
bars1 = SortedDf.Tournament
bars2 = SortedDf.TournamentChampion
bars3 = SortedDf.RunnerUp
# set width of bar
barWidth = 0.3
# Set position of bar on X axis
r1 = np.arange(len(bars1)) + barWidth
r2 = [x + barWidth for x in r1]
r3 = [x + barWidth for x in r2]
r4 = [x + barWidth for x in r3]
# Make the plot
plt.bar(r1, bars1, color='b', width=barWidth, edgecolor='white', label='Tournament')
plt.bar(r2, bars2, color='g', width=barWidth, edgecolor='white', label='TournamentChampion')
plt.bar(r3, bars3, color='y', width=barWidth, edgecolor='white', label='RunnerUp')
# Add xticks on the middle of the group bars
plt.xlabel('Teams', fontweight='bold')
plt.xticks([r + (barWidth) for r in range(len(bars1))], SortedDf.Team)
plt.title("Tournament vs TournamentChampion vs Runnerup ",size=35);
# Create legend & Show graphic
plt.legend()
plt.show()
SortedDf=basketball.sort_values('PlayedGames',ascending=False);
plt.figure(figsize=(40,10))
# set height of bar
bars1 = SortedDf.PlayedGames
bars2 = SortedDf.WonGames
bars3 = SortedDf.DrawnGames
bars4 = SortedDf.LostGames
# set width of bar
barWidth = 0.3
# Set position of bar on X axis
r1 = np.arange(len(bars1))+barWidth
r2 = [x + barWidth for x in r1]
r3 = [x + barWidth for x in r2]
r4 = [x + barWidth for x in r3]
# Make the plot
plt.bar(r1, bars1, color='b', width=barWidth, edgecolor='white', label='PlayedGames')
plt.bar(r2, bars2, color='g', width=barWidth, edgecolor='white', label='WonGames')
plt.bar(r3, bars3, color='y', width=barWidth, edgecolor='white', label='DrawnGames')
plt.bar(r4, bars4, color='m', width=barWidth, edgecolor='white', label='LostGames')
# Add xticks on the middle of the group bars
plt.xlabel('Teams', fontweight='bold')
plt.xticks([r + (3*barWidth) for r in range(len(bars1))], SortedDf.Team)
plt.title("PlayedGames vs WonGames Vs DrawnGames vs LostGames",size=35)
# Create legend & Show graphic
plt.legend()
plt.show()
SortedDf=basketball.sort_values('BasketScored',ascending=False);
plt.figure(figsize=(40,10))
# set height of bar
bars1 = SortedDf.BasketScored
bars2 = SortedDf.BasketGiven
# set width of bar
barWidth = 0.3
# Set position of bar on X axis
r1 = np.arange(len(bars1))+barWidth
r2 = [x + barWidth for x in r1]
r3 = [x + barWidth for x in r2]
r4 = [x + barWidth for x in r3]
# Make the plot
plt.bar(r1, bars1, color='g', width=barWidth, edgecolor='white', label='BasketScored')
plt.bar(r2, bars2, color='r', width=barWidth, edgecolor='white', label='BasketGiven')
# Add xticks on the middle of the group bars
plt.xlabel('Teams', fontweight='bold')
plt.xticks([r + (barWidth) for r in range(len(bars1))], SortedDf.Team)
plt.title("BasketScored vs BasketGiven",size=35)
# Create legend & Show graphic
plt.legend()
plt.show()
# No of matches won/lost by teams
plt.figure(figsize=(20,8))
plt.subplot(1,2,1)
bb = basketball[['Team','WonGames']].sort_values(by="WonGames", ascending=False)
plot = plt.pie(bb['WonGames'], labels=list(bb['Team'][:15]) + [str()] * (len(bb)-15), explode=[0.015*x for x in range(len(bb))])
plt.title('# matches won')
plt.subplot(1,2,2)
bb = basketball[['Team','LostGames']].sort_values(by="LostGames", ascending=False)
plot = plt.pie(bb['LostGames'], labels=list(bb['Team'][:20]) + [str()] * (len(bb)-20), explode=[0.015*x for x in range(len(bb))])
plt.title('# matches lost')
Text(0.5, 1.0, '# matches lost')
# No of baskets scored/given by teams
plt.figure(figsize=(20,8))
plt.subplot(1,2,1)
bb = basketball[['Team','BasketScored']].sort_values(by="BasketScored", ascending=False)
plot = plt.pie(bb['BasketScored'], labels=list(bb['Team'][:15]) + [str()] * (len(bb)-15))
plt.title('# Baskets scored')
plt.subplot(1,2,2)
bb = basketball[['Team','BasketGiven']].sort_values(by="BasketGiven", ascending=False)
plot = plt.pie(bb['BasketGiven'], labels=list(bb['Team'][:20]) + [str()] * (len(bb)-20))
plt.title('# Baskets given')
Text(0.5, 1.0, '# Baskets given')
# The most RunnerUp
plt.figure(figsize=(20,8))
bb = basketball[['Team','RunnerUp']].sort_values(by="RunnerUp", ascending=False).where(lambda x: x["RunnerUp"] > 0)
sns.barplot(bb['Team'], bb['RunnerUp'])
plt.xticks(rotation=90)
(array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]), [Text(0, 0, 'T2'), Text(1, 0, 'T1'), Text(2, 0, 'T3'), Text(3, 0, 'T5'), Text(4, 0, 'T4'), Text(5, 0, 'T11'), Text(6, 0, 'T6'), Text(7, 0, 'T8'), Text(8, 0, 'T9'), Text(9, 0, 'T21'), Text(10, 0, 'T20'), Text(11, 0, 'T14'), Text(12, 0, 'T15')])
Important Insights:
Through bivariate analysis we try to analyze two variables simultaneously. As opposed to univariate analysis where we check the characteristics of a single variable, in bivariate analysis we try to determine if there is any relationship between two variables.
There are essentially 3 major scenarios that we will come accross when we perform bivariate analysis:
SortedDf=basketball.sort_values('PlayedGames',ascending=False);
fig, ax = plt.subplots(figsize=(40,10)) # setting the figure size of the plot
ax.scatter(SortedDf['Team'], SortedDf['PlayedGames']) # scatter plot
ax.set_xlabel('Team ', fontsize=20)
ax.set_ylabel('Played Games', fontsize=20)
team=np.array(SortedDf.Team);
basket=np.array(SortedDf.PlayedGames);
for i, txt in enumerate(SortedDf.TeamLaunch):
plt.annotate(txt, (team[i], basket[i]))
plt.title("Team vs PlayedGames annotated with TeamLaunch",size=35);
plt.show()
# basketball['TeamLaunchCategory'] = (pd.qcut(basketball['TeamLaunch'], 4,
# labels=['Very Old', 'Old', 'New', 'Very New']));
GamesPlayed=basketball.sort_values('PlayedGames',ascending=False).groupby('TeamLaunchCategory')['PlayedGames'].sum().reset_index();
plt.figure(figsize=(15,5))
sns.barplot(x ='TeamLaunchCategory', y='PlayedGames' ,data = GamesPlayed)
plt.title("PlayedGames across Team launch category",size=15)
plt.show()
plt.figure(figsize=(30,10))
ax = sns.barplot(x="Team", y="Tournament", data=basketball)
# Scores of the teams
plt.figure(figsize=(20,15))
ax = sns.barplot(x="Score", y="Team", data=basketball, orient='h')
plt.figure(figsize=(40,10))
sns.barplot(data=basketball,x='Team',y='Score',hue='TeamLaunchCategory')
plt.show()
#Plot correlation matrix using pandas, and select values in upper triangle
#https://stackoverflow.com/questions/29432629/plot-correlation-matrix-using-pandas
CorMatrix = basketball.corr().abs()
Upper = CorMatrix.where(np.triu(np.ones(CorMatrix.shape),k=1).astype(np.bool))
Upper
| Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | RunnerUp | TeamLaunch | HighestPositionHeld | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Tournament | NaN | 0.981258 | 0.998683 | 0.971749 | 0.989295 | 0.949350 | 0.974867 | 0.987781 | 0.589304 | 0.645924 | 0.599166 | 0.708002 |
| Score | NaN | NaN | 0.979687 | 0.997232 | 0.958090 | 0.878780 | 0.996634 | 0.943801 | 0.715014 | 0.764819 | 0.538542 | 0.669215 |
| PlayedGames | NaN | NaN | NaN | 0.967486 | 0.993972 | 0.955925 | 0.969970 | 0.990020 | 0.575072 | 0.633101 | 0.581243 | 0.711065 |
| WonGames | NaN | NaN | NaN | NaN | 0.939416 | 0.851436 | 0.999312 | 0.925745 | 0.752904 | 0.797849 | 0.540421 | 0.647819 |
| DrawnGames | NaN | NaN | NaN | NaN | NaN | 0.972786 | 0.942234 | 0.992479 | 0.503378 | 0.566021 | 0.570175 | 0.720248 |
| LostGames | NaN | NaN | NaN | NaN | NaN | NaN | 0.857843 | 0.985041 | 0.324025 | 0.391645 | 0.591571 | 0.726172 |
| BasketScored | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.931079 | 0.743847 | 0.791041 | 0.550150 | 0.648672 |
| BasketGiven | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.471618 | 0.533448 | 0.608902 | 0.727323 |
| TournamentChampion | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.971533 | 0.286768 | 0.304307 |
| RunnerUp | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.317250 | 0.359191 |
| TeamLaunch | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.599679 |
| HighestPositionHeld | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
PerformanceMatrix=basketball[['Team','TeamLaunch','TeamLaunchCategory','PlayedGames','HighestPositionHeld']]
PerformanceMatrix['Win']=round((basketball.WonGames/ basketball.PlayedGames)*100,2)
PerformanceMatrix['Drawn']=round((basketball.DrawnGames/ basketball.PlayedGames)*100,2)
PerformanceMatrix['Lost']=round((basketball.LostGames/ basketball.PlayedGames)*100,2)
PerformanceMatrix['TChampionship']=round((basketball.TournamentChampion/ basketball.Tournament)*100,2)
PerformanceMatrix['TRunnerUp']=round((basketball.RunnerUp/ basketball.Tournament)*100,2)
PerformanceMatrix.describe()
| TeamLaunch | PlayedGames | HighestPositionHeld | Win | Drawn | Lost | TChampionship | TRunnerUp | |
|---|---|---|---|---|---|---|---|---|
| count | 60.000000 | 60.000000 | 60.000000 | 60.000000 | 60.000000 | 60.000000 | 60.000000 | 60.000000 |
| mean | 1957.950000 | 810.100000 | 7.050000 | 31.364667 | 23.617167 | 44.992167 | 1.749500 | 1.956333 |
| std | 26.646732 | 877.465393 | 5.315232 | 7.831421 | 4.459405 | 8.401111 | 6.442478 | 5.491846 |
| min | 1929.000000 | 30.000000 | 1.000000 | 16.670000 | 11.110000 | 20.380000 | 0.000000 | 0.000000 |
| 25% | 1934.750000 | 115.500000 | 3.000000 | 27.610000 | 22.330000 | 41.147500 | 0.000000 | 0.000000 |
| 50% | 1950.500000 | 424.500000 | 6.000000 | 30.490000 | 24.370000 | 45.725000 | 0.000000 | 0.000000 |
| 75% | 1977.250000 | 1345.500000 | 10.000000 | 33.542500 | 26.212500 | 48.542500 | 0.000000 | 0.000000 |
| max | 2016.000000 | 2762.000000 | 20.000000 | 59.630000 | 38.600000 | 70.000000 | 38.370000 | 29.070000 |
fig, ax = plt.subplots(1, 3)
fig.set_figheight(5)
fig.set_figwidth(15)
sns.distplot(PerformanceMatrix['Win'], ax = ax[0])
sns.distplot(PerformanceMatrix['Lost'], ax = ax[1])
sns.distplot(PerformanceMatrix['Drawn'], ax = ax[2])
ax[0].set_title('Win %')
ax[1].set_title('Lost %')
ax[2].set_title('Drawn %')
plt.show()
fig, ax = plt.subplots(2, 3)
fig.set_figheight(5)
fig.set_figwidth(15)
sns.boxplot(x=PerformanceMatrix['Win'],ax=ax[0][0]);
sns.boxplot(x=PerformanceMatrix['Lost'],ax=ax[0][1]);
sns.boxplot(x=PerformanceMatrix['Drawn'],ax=ax[0][2]);
sns.boxplot(x=PerformanceMatrix['TChampionship'],ax=ax[1][0]);
sns.boxplot(x=PerformanceMatrix['TRunnerUp'],ax=ax[1][1]);
sns.boxplot(x=PerformanceMatrix['TeamLaunch'],ax=ax[1][2]);
fig.tight_layout()
plt.show()
# Get the z-score of every value with respect to their columns
PmSelect = PerformanceMatrix[['Win','Drawn','Lost']];
z = np.abs(stats.zscore(PmSelect))
threshold = 3
np.where(z > threshold)
(array([ 0, 1, 45, 59], dtype=int64), array([0, 0, 1, 2], dtype=int64))
PerformanceMatrix.iloc[[0,1,45,59]]
| Team | TeamLaunch | TeamLaunchCategory | PlayedGames | HighestPositionHeld | Win | Drawn | Lost | TChampionship | TRunnerUp | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | T1 | 1929 | Very Old | 2762 | 1 | 59.63 | 19.99 | 20.38 | 38.37 | 26.74 |
| 1 | T2 | 1929 | Very Old | 2762 | 1 | 57.24 | 20.75 | 22.01 | 29.07 | 29.07 |
| 45 | T46 | 1990 | Very New | 114 | 9 | 22.81 | 38.60 | 38.60 | 0.00 | 0.00 |
| 59 | T60 | 1955 | New | 30 | 15 | 16.67 | 13.33 | 70.00 | 0.00 | 0.00 |
Important Insights:
SortedDf=PerformanceMatrix.sort_values('Win',ascending=False);
plt.figure(figsize=(35,10));
# set height of bar
bars1 = SortedDf.Win
bars2 = SortedDf.Drawn
bars3 = SortedDf.Lost
# set width of bar
barWidth = 0.2
# Set position of bar on X axis
r1 = np.arange(len(bars1))+barWidth
r2 = [x + barWidth for x in r1]
r3 = [x + barWidth for x in r2]
# Make the plotax
plt.bar(r1, bars1, color='g', width=barWidth, edgecolor='white', label='Win%')
plt.bar(r2, bars2, color='y', width=barWidth, edgecolor='white', label='Drawn%')
plt.bar(r3, bars3, color='r', width=barWidth, edgecolor='white', label='Lost%')
# Add xticks on the middle of the group bars
plt.xlabel('Teams', fontweight='bold')
plt.xticks([r + (3*barWidth) for r in range(len(bars1))], SortedDf.Team)
plt.title("Win% Vs Drawn% Vs Lost%",size=35)
# Create legend & Show graphic
plt.legend()
plt.show()
SortedDf=(PerformanceMatrix.sort_values(['TeamLaunch'],ascending=[True]))
plt.figure(figsize=(30,5))
sns.pointplot(x='Team',y='PlayedGames',hue='TeamLaunchCategory' ,data=SortedDf);
plt.title("Played games across teams in the order of team launch ")
plt.show()
SortedDf=(PerformanceMatrix.sort_values(['TeamLaunch'],ascending=[True]))
fig, ax =plt.subplots(6,figsize=(30, 30), sharey=True)
sns.pointplot(x='Team',y='Win',hue='TeamLaunchCategory' ,data=SortedDf,ax=ax[0]);
sns.pointplot(x='Team',y='Drawn',hue='TeamLaunchCategory' ,data=SortedDf,ax=ax[1]);
sns.pointplot(x='Team',y='Lost',hue='TeamLaunchCategory' ,data=SortedDf,ax=ax[2]);
sns.pointplot(x='Team',y='TChampionship',hue='TeamLaunchCategory' ,data=SortedDf,ax=ax[3]);
sns.pointplot(x='Team',y='TRunnerUp',hue='TeamLaunchCategory' ,data=SortedDf,ax=ax[4]);
sns.pointplot(x='Team',y='HighestPositionHeld',hue='TeamLaunchCategory' ,data=SortedDf,ax=ax[5]);
fig.tight_layout()
plt.subplots_adjust(hspace=0.3)
ax[0].set_title("Performance report of teams in playing games\n Win%",size='30')
ax[1].set_title("Drawn %",size='30')
ax[2].set_title("Lost %",size='30')
ax[3].set_title("Tournment Championship %",size='30')
ax[4].set_title("Tournment RunnerUp %",size='30')
ax[5].set_title("HighestPositionHeld",size='30')
Text(0.5, 1.0, 'HighestPositionHeld')
GroupLostMean=PerformanceMatrix.groupby(['TeamLaunchCategory'])['Lost'].mean().reset_index();
GroupDrawnMean=PerformanceMatrix.groupby(['TeamLaunchCategory'])['Drawn'].mean().reset_index();
GroupWinMean=PerformanceMatrix.groupby(['TeamLaunchCategory'])['Win'].mean().reset_index();
fig, ax =plt.subplots((3),figsize=(10, 10), sharey=True)
sns.pointplot(x=GroupWinMean.TeamLaunchCategory,y=GroupWinMean.Win,data=GroupWinMean,ax=ax[0]);
sns.pointplot(x=GroupDrawnMean.TeamLaunchCategory,y=GroupDrawnMean.Drawn,data=GroupDrawnMean,ax=ax[1]);
sns.pointplot(x=GroupLostMean.TeamLaunchCategory,y=GroupLostMean.Lost,data=GroupLostMean,ax=ax[2]);
fig.tight_layout()
plt.subplots_adjust(hspace=0.4)
ax[0].set_title("Analyze Team launch categories \n Mean Win%",size='20')
ax[1].set_title("Mean Drawn %",size='20')
ax[2].set_title("Mean Lost %",size='20')
plt.show()
Important Insights:
PerformanceMatrix.sort_values(['Win'],ascending=[False]).head(10)
| Team | TeamLaunch | TeamLaunchCategory | PlayedGames | HighestPositionHeld | Win | Drawn | Lost | TChampionship | TRunnerUp | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | T1 | 1929 | Very Old | 2762 | 1 | 59.63 | 19.99 | 20.38 | 38.37 | 26.74 |
| 1 | T2 | 1929 | Very Old | 2762 | 1 | 57.24 | 20.75 | 22.01 | 29.07 | 29.07 |
| 2 | T3 | 1929 | Very Old | 2614 | 1 | 47.48 | 22.88 | 29.65 | 12.50 | 10.00 |
| 3 | T4 | 1931 | Very Old | 2664 | 1 | 44.56 | 23.12 | 32.32 | 7.32 | 7.32 |
| 4 | T5 | 1929 | Very Old | 2762 | 1 | 43.77 | 22.92 | 33.31 | 9.30 | 8.14 |
| 20 | T21 | 1998 | Very New | 646 | 2 | 41.18 | 26.63 | 32.20 | 0.00 | 5.88 |
| 5 | T6 | 1934 | Very Old | 2408 | 1 | 41.11 | 22.05 | 36.84 | 1.37 | 5.48 |
| 7 | T8 | 1929 | Very Old | 2302 | 1 | 37.53 | 25.07 | 37.40 | 2.86 | 4.29 |
| 10 | T11 | 1941 | Old | 1530 | 1 | 36.80 | 25.62 | 37.58 | 2.22 | 11.11 |
| 6 | T7 | 1929 | Very Old | 2626 | 3 | 36.10 | 23.15 | 40.75 | 0.00 | 0.00 |
PerformanceMatrix[PerformanceMatrix.TeamLaunchCategory != 'Very Old'].sort_values(['Win'],ascending=[False]).head(10)
| Team | TeamLaunch | TeamLaunchCategory | PlayedGames | HighestPositionHeld | Win | Drawn | Lost | TChampionship | TRunnerUp | |
|---|---|---|---|---|---|---|---|---|---|---|
| 20 | T21 | 1998 | Very New | 646 | 2 | 41.18 | 26.63 | 32.20 | 0.00 | 5.88 |
| 10 | T11 | 1941 | Old | 1530 | 1 | 36.80 | 25.62 | 37.58 | 2.22 | 11.11 |
| 8 | T9 | 1939 | Old | 1986 | 2 | 35.15 | 26.28 | 38.57 | 0.00 | 1.72 |
| 11 | T12 | 1939 | Old | 1698 | 4 | 34.51 | 22.91 | 42.58 | 0.00 | 0.00 |
| 18 | T19 | 1960 | New | 988 | 3 | 33.70 | 25.91 | 40.38 | 0.00 | 0.00 |
| 19 | T20 | 1951 | New | 1096 | 2 | 33.49 | 22.08 | 44.43 | 0.00 | 3.03 |
| 38 | T39 | 1994 | Very New | 160 | 10 | 32.50 | 28.12 | 39.38 | 0.00 | 0.00 |
| 15 | T16 | 1935 | Old | 1318 | 4 | 32.32 | 24.81 | 42.87 | 0.00 | 0.00 |
| 14 | T15 | 1944 | Old | 1458 | 2 | 32.30 | 24.55 | 43.14 | 0.00 | 2.33 |
| 24 | T25 | 2004 | Very New | 456 | 6 | 32.24 | 24.56 | 43.20 | 0.00 | 0.00 |
basketball[(basketball.PlayedGames==basketball.PlayedGames.max())]
| Team | Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | RunnerUp | TeamLaunch | HighestPositionHeld | TeamLaunchCategory | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | T1 | 86 | 4385 | 2762 | 1647 | 552 | 563 | 5947 | 3140 | 33 | 23 | 1929 | 1 | Very Old |
| 1 | T2 | 86 | 4262 | 2762 | 1581 | 573 | 608 | 5900 | 3114 | 25 | 25 | 1929 | 1 | Very Old |
| 4 | T5 | 86 | 3368 | 2762 | 1209 | 633 | 920 | 4631 | 3700 | 8 | 7 | 1929 | 1 | Very Old |
PerformanceMatrix[(PerformanceMatrix.Win==PerformanceMatrix.Win.max()) & (PerformanceMatrix.Lost==PerformanceMatrix.Lost.min())]
| Team | TeamLaunch | TeamLaunchCategory | PlayedGames | HighestPositionHeld | Win | Drawn | Lost | TChampionship | TRunnerUp | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | T1 | 1929 | Very Old | 2762 | 1 | 59.63 | 19.99 | 20.38 | 38.37 | 26.74 |
# Performance order of teams is T1>T2>T5
# Above teams are Outliers, So we look for other max target teams
basketball[(basketball.PlayedGames==basketball.PlayedGames.nlargest(4)[3])]
| Team | Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | RunnerUp | TeamLaunch | HighestPositionHeld | TeamLaunchCategory | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | T4 | 82 | 3386 | 2664 | 1187 | 616 | 861 | 4398 | 3469 | 6 | 6 | 1931 | 1 | Very Old |
(basketball.sort_values(['PlayedGames'],ascending=[True])).head(3)
| Team | Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | RunnerUp | TeamLaunch | HighestPositionHeld | TeamLaunchCategory | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 59 | T60 | 1 | 14 | 30 | 5 | 4 | 21 | 34 | 65 | 0 | 0 | 1955 | 15 | New |
| 57 | T58 | 1 | 22 | 30 | 7 | 8 | 15 | 37 | 57 | 0 | 0 | 1956 | 16 | New |
| 58 | T59 | 1 | 19 | 30 | 7 | 5 | 18 | 51 | 85 | 0 | 0 | 1951 | 16 | New |
(basketball.sort_values(['HighestPositionHeld'],ascending=[False])).head(1)
| Team | Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | RunnerUp | TeamLaunch | HighestPositionHeld | TeamLaunchCategory | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 56 | T57 | 1 | 34 | 38 | 8 | 10 | 20 | 38 | 66 | 0 | 0 | 2009 | 20 | Very New |
sorted=PerformanceMatrix.loc[PerformanceMatrix['TeamLaunchCategory']=='Very Old'].sort_values(['Win'],ascending=[True])
sorted.head(5)
| Team | TeamLaunch | TeamLaunchCategory | PlayedGames | HighestPositionHeld | Win | Drawn | Lost | TChampionship | TRunnerUp | |
|---|---|---|---|---|---|---|---|---|---|---|
| 51 | T52 | 1929 | Very Old | 72 | 6 | 29.17 | 19.44 | 51.39 | 0.0 | 0.00 |
| 13 | T14 | 1929 | Very Old | 1428 | 2 | 31.72 | 23.53 | 44.75 | 0.0 | 2.27 |
| 28 | T29 | 1930 | Very Old | 380 | 6 | 32.89 | 21.32 | 45.79 | 0.0 | 0.00 |
| 44 | T45 | 1929 | Very Old | 130 | 3 | 33.08 | 16.15 | 50.77 | 0.0 | 0.00 |
| 53 | T54 | 1929 | Very Old | 54 | 8 | 33.33 | 11.11 | 55.56 | 0.0 | 0.00 |
PerformanceMatrix.sort_values(['Drawn'],ascending=[False]).head(1)
| Team | TeamLaunch | TeamLaunchCategory | PlayedGames | HighestPositionHeld | Win | Drawn | Lost | TChampionship | TRunnerUp | |
|---|---|---|---|---|---|---|---|---|---|---|
| 45 | T46 | 1990 | Very New | 114 | 9 | 22.81 | 38.6 | 38.6 | 0.0 | 0.0 |
sorted=(basketball.sort_values(['PlayedGames'],ascending=[True]))
sorted[(sorted.TeamLaunch==sorted.TeamLaunch.min())].head(2)
| Team | Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | RunnerUp | TeamLaunch | HighestPositionHeld | TeamLaunchCategory | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 53 | T54 | 3 | 42 | 54 | 18 | 6 | 30 | 97 | 131 | 0 | 0 | 1929 | 8 | Very Old |
| 51 | T52 | 4 | 56 | 72 | 21 | 14 | 37 | 153 | 184 | 0 | 0 | 1929 | 6 | Very Old |
TeamsFilters=['T21','T39','T46','T19','T20','T11','T9','T5','T3','T4','T6','T14','T7','T8','T10','T18']
BetterTeams=pd.DataFrame(columns=PerformanceMatrix.columns);
for i in TeamsFilters:
if i in PerformanceMatrix.Team.values:
BetterTeams = pd.concat([BetterTeams,PerformanceMatrix[PerformanceMatrix.Team==i]], ignore_index=True)
BetterTeams.sort_values('Win',ascending=False)
| Team | TeamLaunch | TeamLaunchCategory | PlayedGames | HighestPositionHeld | Win | Drawn | Lost | TChampionship | TRunnerUp | |
|---|---|---|---|---|---|---|---|---|---|---|
| 8 | T3 | 1929 | Very Old | 2614 | 1 | 47.48 | 22.88 | 29.65 | 12.50 | 10.00 |
| 9 | T4 | 1931 | Very Old | 2664 | 1 | 44.56 | 23.12 | 32.32 | 7.32 | 7.32 |
| 7 | T5 | 1929 | Very Old | 2762 | 1 | 43.77 | 22.92 | 33.31 | 9.30 | 8.14 |
| 0 | T21 | 1998 | Very New | 646 | 2 | 41.18 | 26.63 | 32.20 | 0.00 | 5.88 |
| 10 | T6 | 1934 | Very Old | 2408 | 1 | 41.11 | 22.05 | 36.84 | 1.37 | 5.48 |
| 13 | T8 | 1929 | Very Old | 2302 | 1 | 37.53 | 25.07 | 37.40 | 2.86 | 4.29 |
| 5 | T11 | 1941 | Old | 1530 | 1 | 36.80 | 25.62 | 37.58 | 2.22 | 11.11 |
| 12 | T7 | 1929 | Very Old | 2626 | 3 | 36.10 | 23.15 | 40.75 | 0.00 | 0.00 |
| 6 | T9 | 1939 | Old | 1986 | 2 | 35.15 | 26.28 | 38.57 | 0.00 | 1.72 |
| 14 | T10 | 1932 | Very Old | 1728 | 1 | 35.07 | 25.46 | 39.47 | 1.96 | 0.00 |
| 15 | T18 | 1933 | Very Old | 1192 | 3 | 34.23 | 24.50 | 41.28 | 0.00 | 0.00 |
| 3 | T19 | 1960 | New | 988 | 3 | 33.70 | 25.91 | 40.38 | 0.00 | 0.00 |
| 4 | T20 | 1951 | New | 1096 | 2 | 33.49 | 22.08 | 44.43 | 0.00 | 3.03 |
| 1 | T39 | 1994 | Very New | 160 | 10 | 32.50 | 28.12 | 39.38 | 0.00 | 0.00 |
| 11 | T14 | 1929 | Very Old | 1428 | 2 | 31.72 | 23.53 | 44.75 | 0.00 | 2.27 |
| 2 | T46 | 1990 | Very New | 114 | 9 | 22.81 | 38.60 | 38.60 | 0.00 | 0.00 |
BetterTeams[BetterTeams.TeamLaunchCategory!='Very Old']
| Team | TeamLaunch | TeamLaunchCategory | PlayedGames | HighestPositionHeld | Win | Drawn | Lost | TChampionship | TRunnerUp | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | T21 | 1998 | Very New | 646 | 2 | 41.18 | 26.63 | 32.20 | 0.00 | 5.88 |
| 1 | T39 | 1994 | Very New | 160 | 10 | 32.50 | 28.12 | 39.38 | 0.00 | 0.00 |
| 2 | T46 | 1990 | Very New | 114 | 9 | 22.81 | 38.60 | 38.60 | 0.00 | 0.00 |
| 3 | T19 | 1960 | New | 988 | 3 | 33.70 | 25.91 | 40.38 | 0.00 | 0.00 |
| 4 | T20 | 1951 | New | 1096 | 2 | 33.49 | 22.08 | 44.43 | 0.00 | 3.03 |
| 5 | T11 | 1941 | Old | 1530 | 1 | 36.80 | 25.62 | 37.58 | 2.22 | 11.11 |
| 6 | T9 | 1939 | Old | 1986 | 2 | 35.15 | 26.28 | 38.57 | 0.00 | 1.72 |
Please find below suggestions for data point collection and other relevant guidelines:
Volume: We can add some more teams for better data understanding. To increase the prediction power of the dataset, we can add some other information like player information, demographics and tournament location Etc. Database systems can move from the traditional to the more advanced big data systems.
Velocity: If we consider the time value of the data, It seems to be outdated and of little use; Particularly if the Big Data project is to serve any real-time or near real-time business needs. In such context we should re-define data quality metrics so that they are relevant as well as feasible in the real-time context.
Variety: For better insights and modeling projects in AI and ML, we can add several other data types like structured, semi-structured, and unstructured coming in from different data sources relevant to the basketball.
Veracity: We have incomplete team information. For example in Team 61: This Team don't have any information about Score, PlayedGames Etc.. It has HighestPoistionHeld as 1. Accuracy of data collection should be improved. Besides data inaccuracies, Veracity also includes data consistency (defined by the statistical reliability of data) and data trustworthiness (based on data origin, data collection and processing methods, security infrastructure, etc.). These data quality issues in turn impact data integrity and data accountability.
Value: The Value characteristic connects directly to the end purpose and the business use cases. We can harness the power of Big Data for many diverse business pursuits, and those pursuits are the real drivers of how data quality is defined, measured, and improved. Data Science is already playing a pivotal role in sports analytics.
Based on a strong understanding of the business use cases and the Big Data architecture, we can design and implement an optimal layer of data governance strategy to further improve the data quality with data definitions, metadata requirements, data ownership, data flow diagrams, etc.
We can add more attributes to the dataset. More relevant attributes will help us to analyze teams accurately like Canceled Games, Basket Ratio, Winning Ratio, Win/Loss Percentage
Simplest ML models can add more value to the present EDA use case.
Big data and data science together allow us to see both the forest and the trees (Micro and Macro perspectives).
Visualization, Dashboarding, and Interactivity makes the data more useful to the general public. We can use the API and deploy it on the cloud to serve our purpose in this context.
DOMAIN: Startup ecosystem
CONTEXT: Company X is a EU online publisher focusing on the startups industry. The company specifically reports on the business related to technology news, analysis of emerging trends and profiling of new tech businesses and products. Their event i.e. Startup Battlefield is the world’s pre-eminent startup competition. Startup Battlefield features 15-30 top early stage startups pitching top judges in front of a vast live audience, present in person and online.
DATA DESCRIPTION: CompanyX_EU.csv - Each row in the dataset is a Start-up company and the columns describe the company.
DATA DICTIONARY:
*Dataset has been downloaded from the internet. All the credit for the dataset goes to the original creator of the data.
PROJECT OBJECTIVE: Analyse the data of the various companies from the given dataset and perform the tasks that are specified in the below steps. Draw insights from the various attributes that are present in the dataset, plot distributions, state hypotheses and draw conclusions from the dataset.
compx=pd.read_csv("Compx.csv");
compx.head()
| Startup | Product | Funding | Event | Result | OperatingState | |
|---|---|---|---|---|---|---|
| 0 | 2600Hz | 2600hz.com | NaN | Disrupt SF 2013 | Contestant | Operating |
| 1 | 3DLT | 3dlt.com | $630K | Disrupt NYC 2013 | Contestant | Closed |
| 2 | 3DPrinterOS | 3dprinteros.com | NaN | Disrupt SF 2016 | Contestant | Operating |
| 3 | 3Dprintler | 3dprintler.com | $1M | Disrupt NY 2016 | Audience choice | Operating |
| 4 | 42 Technologies | 42technologies.com | NaN | Disrupt NYC 2013 | Contestant | Operating |
compx.tail()
| Startup | Product | Funding | Event | Result | OperatingState | |
|---|---|---|---|---|---|---|
| 657 | Zivity | zivity.com | $8M | TC40 2007 | Contestant | Operating |
| 658 | Zmorph | zmorph3d.com | $1M | - | Audience choice | Operating |
| 659 | Zocdoc | zocdoc.com | $223M | TC40 2007 | Contestant | Operating |
| 660 | Zula | zulaapp.com | $3.4M | Disrupt SF 2013 | Audience choice | Operating |
| 661 | Zumper | zumper.com | $31.5M | Disrupt SF 2012 | Finalist | Operating |
compx.shape
(662, 6)
compx.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 662 entries, 0 to 661 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Startup 662 non-null object 1 Product 656 non-null object 2 Funding 448 non-null object 3 Event 662 non-null object 4 Result 662 non-null object 5 OperatingState 662 non-null object dtypes: object(6) memory usage: 31.2+ KB
def missing_check(df):
total = df.isnull().sum().sort_values(ascending=False) # total number of null values
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False) # percentage of values that are null
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent']) # putting the above two together
return missing_data # return the dataframe
missing_check(compx)
| Total | Percent | |
|---|---|---|
| Funding | 214 | 0.323263 |
| Product | 6 | 0.009063 |
| Startup | 0 | 0.000000 |
| Event | 0 | 0.000000 |
| Result | 0 | 0.000000 |
| OperatingState | 0 | 0.000000 |
compx1 = compx.dropna(subset=['Funding', 'Product'])
compx1
| Startup | Product | Funding | Event | Result | OperatingState | |
|---|---|---|---|---|---|---|
| 1 | 3DLT | 3dlt.com | $630K | Disrupt NYC 2013 | Contestant | Closed |
| 3 | 3Dprintler | 3dprintler.com | $1M | Disrupt NY 2016 | Audience choice | Operating |
| 5 | 5to1 | 5to1.com | $19.3M | TC50 2009 | Contestant | Acquired |
| 6 | 8 Securities | 8securities.com | $29M | Disrupt Beijing 2011 | Finalist | Operating |
| 10 | AdhereTech | adheretech.com | $1.8M | Hardware Battlefield 2014 | Contestant | Operating |
| ... | ... | ... | ... | ... | ... | ... |
| 657 | Zivity | zivity.com | $8M | TC40 2007 | Contestant | Operating |
| 658 | Zmorph | zmorph3d.com | $1M | - | Audience choice | Operating |
| 659 | Zocdoc | zocdoc.com | $223M | TC40 2007 | Contestant | Operating |
| 660 | Zula | zulaapp.com | $3.4M | Disrupt SF 2013 | Audience choice | Operating |
| 661 | Zumper | zumper.com | $31.5M | Disrupt SF 2012 | Finalist | Operating |
446 rows × 6 columns
compx1.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 446 entries, 1 to 661 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Startup 446 non-null object 1 Product 446 non-null object 2 Funding 446 non-null object 3 Event 446 non-null object 4 Result 446 non-null object 5 OperatingState 446 non-null object dtypes: object(6) memory usage: 24.4+ KB
compx1.loc[:,'Funds_in_million'] = compx1['Funding'].apply(lambda x: float(x[1:-1])/1000 if x[-1] == 'K'
else (float(x[1:-1])*1000 if x[-1] == 'B'
else float(x[1:-1])))
compx1
| Startup | Product | Funding | Event | Result | OperatingState | Funds_in_million | |
|---|---|---|---|---|---|---|---|
| 1 | 3DLT | 3dlt.com | $630K | Disrupt NYC 2013 | Contestant | Closed | 0.63 |
| 3 | 3Dprintler | 3dprintler.com | $1M | Disrupt NY 2016 | Audience choice | Operating | 1.00 |
| 5 | 5to1 | 5to1.com | $19.3M | TC50 2009 | Contestant | Acquired | 19.30 |
| 6 | 8 Securities | 8securities.com | $29M | Disrupt Beijing 2011 | Finalist | Operating | 29.00 |
| 10 | AdhereTech | adheretech.com | $1.8M | Hardware Battlefield 2014 | Contestant | Operating | 1.80 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 657 | Zivity | zivity.com | $8M | TC40 2007 | Contestant | Operating | 8.00 |
| 658 | Zmorph | zmorph3d.com | $1M | - | Audience choice | Operating | 1.00 |
| 659 | Zocdoc | zocdoc.com | $223M | TC40 2007 | Contestant | Operating | 223.00 |
| 660 | Zula | zulaapp.com | $3.4M | Disrupt SF 2013 | Audience choice | Operating | 3.40 |
| 661 | Zumper | zumper.com | $31.5M | Disrupt SF 2012 | Finalist | Operating | 31.50 |
446 rows × 7 columns
plt.figure(figsize=(20,5))
ax = sns.boxplot(x=compx1["Funds_in_million"])
# Adding swarmplot to boxplot for better visualization
plt.figure(figsize=(20,8))
ax = sns.boxplot(x=compx1["Funds_in_million"], whis=np.inf)
ax = sns.swarmplot(x=compx1["Funds_in_million"], color=".2")
# We can also use jitter for better visualization
# ax = sns.stripplot(x=compx1["Funds_in_million"], color=".2", jitter=0.3)
# Calculate the IQR
cols = ['Funds_in_million']
Q1 = compx1[cols].quantile(0.25)
Q3 = compx1[cols].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
Funds_in_million 8.72975 dtype: float64
Outliers = np.where(compx1[cols] > (Q3 + 1.5 * IQR))
Outliers
(array([ 3, 16, 21, 27, 33, 67, 71, 82, 84, 85, 87, 88, 102,
111, 117, 121, 122, 127, 129, 140, 144, 146, 153, 157, 183, 187,
189, 192, 205, 208, 212, 221, 230, 260, 263, 264, 267, 286, 291,
294, 311, 318, 365, 371, 375, 389, 395, 400, 403, 409, 418, 432,
433, 436, 437, 438, 439, 440, 443, 445], dtype=int64),
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int64))
# Rows without outliers
NonOutliers = compx1[cols][~((compx1[cols] > (Q3 + 1.5 * IQR))).any(axis=1)]
NonOutliers
| Funds_in_million | |
|---|---|
| 1 | 0.63 |
| 3 | 1.00 |
| 5 | 19.30 |
| 10 | 1.80 |
| 11 | 1.00 |
| ... | ... |
| 645 | 10.00 |
| 646 | 0.90 |
| 657 | 8.00 |
| 658 | 1.00 |
| 660 | 3.40 |
386 rows × 1 columns
NonOutliers.shape, compx1.shape
((386, 1), (446, 7))
NumberOfOutliers = compx1.shape[0] - NonOutliers.shape[0]
NumberOfOutliers
60
compx1["OperatingState"].value_counts()
Operating 319 Acquired 66 Closed 57 Ipo 4 Name: OperatingState, dtype: int64
plt.figure(figsize=(10,5))
ax = sns.countplot(x="OperatingState", data=compx1)
compx1.shape[0]
446
plt.figure(figsize=(10,5))
sns.barplot(x ='OperatingState', y='Funds_in_million' ,data = compx1)
plt.title("Operating State vs Funds in Million",size=15)
plt.show()
#Understand the distribution of 'Funds_in_million' variable
plt.figure(figsize=(10,5))
ax = sns.histplot(compx1['Funds_in_million'], log_scale=True, kde=True)
# Companies Operating vs Companies Closed
CompOperating = compx1[['Funds_in_million','OperatingState']][compx1['OperatingState']=='Operating']
CompClosed = compx1[['Funds_in_million','OperatingState']][compx1['OperatingState']=='Closed']
CompOperating.head()
| Funds_in_million | OperatingState | |
|---|---|---|
| 3 | 1.00 | Operating |
| 6 | 29.00 | Operating |
| 10 | 1.80 | Operating |
| 13 | 1.00 | Operating |
| 14 | 0.75 | Operating |
CompOperating.describe()
| Funds_in_million | |
|---|---|
| count | 319.000000 |
| mean | 19.062186 |
| std | 104.677241 |
| min | 0.005000 |
| 25% | 0.745500 |
| 50% | 2.200000 |
| 75% | 8.250000 |
| max | 1700.000000 |
plt.figure(figsize=(10,5))
ax = sns.histplot(CompOperating['Funds_in_million'], log_scale=True, kde=True)
CompClosed.head()
| Funds_in_million | OperatingState | |
|---|---|---|
| 1 | 0.6300 | Closed |
| 11 | 1.0000 | Closed |
| 17 | 0.9292 | Closed |
| 20 | 1.9000 | Closed |
| 34 | 0.6000 | Closed |
CompClosed.describe()
| Funds_in_million | |
|---|---|
| count | 57.000000 |
| mean | 3.258170 |
| std | 5.923294 |
| min | 0.093000 |
| 25% | 0.475000 |
| 50% | 0.929200 |
| 75% | 3.500000 |
| max | 35.500000 |
plt.figure(figsize=(10,5))
ax = sns.histplot(CompClosed['Funds_in_million'], log_scale=True, kde=True)
Important Insights:
From the graphical and descriptive statistics it can be concluded that there is significant difference between the Funds raised by the companies that are operating compared to the companies that are closed.
More would be validated using the Hypothesis Testing.
The two hypotheses for this particular two sample t-test are as follows:
Assumptions:
Observations in two groups have an approximately normal distribution (Shapiro-Wilks Test)
Homogeneity of variances (variances are equal between treatment groups) (Levene or Bartlett Test); It's different in our case here.
The two groups are sampled independently from each other from the same population
Note: Two sample t-test is relatively robust to the assumption of normality and homogeneity of variances when sample size is large (n ≥ 30) and there are equal number of samples (n1 = n2) in both groups.
If the sample size small and does not follow the normal distribution, We should use non-parametric Mann-Whitney U test (Wilcoxon rank sum test).
Conduct a two sample t-test:
Next, we’ll use the ttest_ind() function from the scipy.stats library to conduct a two sample t-test, which uses the following syntax:
ttest_ind(a, b, equal_var=True)
where:
Before we perform the test, we need to decide if we’ll assume the two populations have equal variances or not. As a rule of thumb, we can assume the populations have equal variances if the ratio of the larger sample variance to the smaller sample variance is less than 4:1.
# Find variance for each group
var1, var2 = np.var(CompOperating['Funds_in_million']), np.var(CompClosed['Funds_in_million'])
print(var1, var2)
10922.975787130521 34.469877948057864
VarianceRatio = var1/var2
VarianceRatio
316.8846667687709
# Lets use the Unequal Variance Case
stats.ttest_ind(CompOperating['Funds_in_million'], CompClosed['Funds_in_million'], equal_var=False)
Ttest_indResult(statistic=2.6727214329249227, pvalue=0.007898898722704632)
Intepretation:
Because the p-value of our test (0.00789) is less than alpha = 0.05, we reject the null hypothesis of the test.
We do have sufficient evidence to say that there is a significant difference between the funds raised by the companies that are operating vs the companies that are closed.
# Make a copy of the original dataset
compxOriginal=compx.copy(deep=True);
compxOriginal.head()
| Startup | Product | Funding | Event | Result | OperatingState | |
|---|---|---|---|---|---|---|
| 0 | 2600Hz | 2600hz.com | NaN | Disrupt SF 2013 | Contestant | Operating |
| 1 | 3DLT | 3dlt.com | $630K | Disrupt NYC 2013 | Contestant | Closed |
| 2 | 3DPrinterOS | 3dprinteros.com | NaN | Disrupt SF 2016 | Contestant | Operating |
| 3 | 3Dprintler | 3dprintler.com | $1M | Disrupt NY 2016 | Audience choice | Operating |
| 4 | 42 Technologies | 42technologies.com | NaN | Disrupt NYC 2013 | Contestant | Operating |
compxOriginal.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 662 entries, 0 to 661 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Startup 662 non-null object 1 Product 656 non-null object 2 Funding 448 non-null object 3 Event 662 non-null object 4 Result 662 non-null object 5 OperatingState 662 non-null object dtypes: object(6) memory usage: 31.2+ KB
compxOriginal["Result"].value_counts()
Contestant 488 Finalist 84 Audience choice 41 Winner 26 Runner up 23 Name: Result, dtype: int64
plt.figure(figsize=(10,5))
ax = sns.countplot(x="Result", data=compxOriginal)
# Companies Operating vs Companies Closed
winnerOp = compxOriginal[['Result','OperatingState']][(compxOriginal['OperatingState']=='Operating')
& (compxOriginal['Result']=='Winner')]
contestantOp = compxOriginal[['Result','OperatingState']][(compxOriginal['OperatingState']=='Operating')
& (compxOriginal['Result']=='Contestant')]
winnerOp.head()
| Result | OperatingState | |
|---|---|---|
| 13 | Winner | Operating |
| 127 | Winner | Operating |
| 166 | Winner | Operating |
| 209 | Winner | Operating |
| 244 | Winner | Operating |
winnerOp.describe()
| Result | OperatingState | |
|---|---|---|
| count | 19 | 19 |
| unique | 1 | 1 |
| top | Winner | Operating |
| freq | 19 | 19 |
contestantOp.head()
| Result | OperatingState | |
|---|---|---|
| 0 | Contestant | Operating |
| 2 | Contestant | Operating |
| 4 | Contestant | Operating |
| 7 | Contestant | Operating |
| 8 | Contestant | Operating |
contestantOp.describe()
| Result | OperatingState | |
|---|---|---|
| count | 332 | 332 |
| unique | 1 | 1 |
| top | Contestant | Operating |
| freq | 332 | 332 |
# Operating winner and contestant
winnerContestantOp = 19 + 332
winnerContestantOp
351
# Using the data from above analysis
winnerOpPercent, contestantOpPercent = 19/351, 332/351
print(winnerOpPercent, contestantOpPercent)
0.05413105413105413 0.9458689458689459
The two hypotheses for this particular two sample z-test are as follows:
# Assuming independent samples, large sample sizes, and the hypothesized population proportion difference as zero.
from statsmodels.stats.proportion import proportions_ztest
count = np.array([19, 332])
nobs = np.array([351, 351])
stat, pval = proportions_ztest(count, nobs)
print('{0:0.3f}'.format(pval), '{0:0.3f}'.format(stat))
0.000 -23.627
Intepretation:
Because the p-value of our test (0.000) is less than alpha = 0.05, we reject the null hypothesis of the test.
We do have sufficient evidence to say that there is a significant difference between the proportion of operating companies in two classes like winners and contestants.
# Usng the str, contains functions, We can solve it
CompDisrupt2013 = compxOriginal[compxOriginal['Event'].str.contains('Disrupt') & compxOriginal['Event'].str.contains('2013')]
CompDisrupt2013
| Startup | Product | Funding | Event | Result | OperatingState | |
|---|---|---|---|---|---|---|
| 0 | 2600Hz | 2600hz.com | NaN | Disrupt SF 2013 | Contestant | Operating |
| 1 | 3DLT | 3dlt.com | $630K | Disrupt NYC 2013 | Contestant | Closed |
| 4 | 42 Technologies | 42technologies.com | NaN | Disrupt NYC 2013 | Contestant | Operating |
| 30 | Ansa | ansa.com | NaN | Disrupt SF 2013 | Contestant | Operating |
| 33 | AppArchitect | apparchitect.com | NaN | Disrupt NYC 2013 | Contestant | Acquired |
| ... | ... | ... | ... | ... | ... | ... |
| 628 | Wanderio | wanderio.com | $275K | Disrupt EU 2013 | Contestant | Operating |
| 637 | What Now Travel | whatnowtravel.com | NaN | Disrupt EU 2013 | Contestant | Operating |
| 641 | Workspot | workspot.com | $15.8M | Disrupt NYC 2013 | Contestant | Operating |
| 656 | Zenefits | zenefits.com | $583.6M | Disrupt NYC 2013 | Finalist | Operating |
| 660 | Zula | zulaapp.com | $3.4M | Disrupt SF 2013 | Audience choice | Operating |
77 rows × 6 columns